Perplexity on Reduced Corpora

نویسنده

  • Hayato Kobayashi
چکیده

This paper studies the idea of removing low-frequency words from a corpus, which is a common practice to reduce computational costs, from a theoretical standpoint. Based on the assumption that a corpus follows Zipf’s law, we derive tradeoff formulae of the perplexity of k-gram models and topic models with respect to the size of the reduced vocabulary. In addition, we show an approximate behavior of each formula under certain conditions. We verify the correctness of our theory on synthetic corpora and examine the gap between theory and practice on real corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language-model optimization by mapping of corpora

It is questionable whether words are really the best basic units for the estimation of stochastic language models grouping frequent word sequences to phrases can improve language models. More generally, we have investigated various coding schemes for a corpus. In this paper, this applied to optimize the perplexity of n-gram language models. In tests on two large corpora (WSJ and BNA) the bigram...

متن کامل

Name Perplexity

The accuracy of a Cross Document Coreference system depends on the amount of context available, which is a parameter that varies greatly from corpora to corpora. This paper presents a statistical model for computing name perplexity classes. For each perplexity class, the prior probability of coreference is estimated. The amount of context required for coreference is controlled by the prior core...

متن کامل

Improving language model perplexity and recognition accuracy for medical dictations via within-domain interpolation with literal and semi-literal corpora

We propose a technique for improving language modeling for automated speech recognition of medical dictations by interpolating finished text (25M words) with small humangenerated literal or/and machine-generated semiliteral corpora. By building and testing interpolated (ILM) with literal (LILM), semiliteral (SILM) and partial (PILM) corpora, we show that both perplexity and recognition results ...

متن کامل

emLam - a Hungarian Language Modeling baseline

This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungarian benchmark corpus is introduced.

متن کامل

Web data harvesting for speech understanding grammar induction

The development of a speech understanding grammar for spoken dialogue systems can be greatly accelerated by using an in-domain corpus. The development of such a corpus, however, is a slow and expensive process. This paper proposes unsupervised, language-agnostic methods for finding relevant corpora in the web and mining the most informative parts. We show that by utilizing perplexity we are abl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014